Denoising Diffusion Implicit Models (DDIM)

This is a learning note of this series of videos.

Paper Link: https://arxiv.org/abs/2010.02502

1. Background

Background:

DDPM Pro: high quality image generation

DDPM Con: require many steps for sampling

Goal:

Accelerate sampling process

How:

Generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones. The resulting training objective is the same as the objective used in DDPM. Therefore, we can choose from existing generative models and only modify the sampling process.

DDIM can massively increase sample efficiency only at a minor cost in sample quality.

Empirical Benefits of DDIMs over DDPMs:

  1. Sampling can be accelerated by 10x to 100x.
  1. Sample Consistency: when the initial latent variable is fixed and generate several samples with Markov chains of various lengths, the resulting samples have similar high-level features.
  1. Due to consistency, DDIMs can perform semantically meaningful image interpolation by manipulating the initial latent variables.

2. Denoising Objective of DDPM

Suppose the true data distribution is q(x0)q(\bold{x}_0), we want to learn an approximate distribution pθ(x0)p_{\theta}(\bold{x}_0) which is close to q(x0)q(\bold{x}_0).

DDPMs are latent variable models of the form

pθ(x0)=pθ(x0:T)dx1:T,wherepθ(x0:T)pθ(xT)t=1Tpθ(t)(xt1xt)p_{\theta}(\bold{x}_0) = \int p_{\theta} (\bold{x}_{0:T}) d \bold{x}_{1:T}, \quad \text{where} \quad p_{\theta} (\bold{x}_{0:T}) \coloneqq p_{\theta} (\bold{x}_T) \prod_{t=1}^T p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)

where x1,,xT\bold{x}_1, \dots, \bold{x}_T are latent variables. The parameters θ\theta are learned to fit the data distribution q(x0)q(\bold{x}_0) by maximizing a variational lower bound:

maxθEq(x0)[logpθ(x0)]maxθEq(x0,x1,,xT)[logpθ(x0:T)logq(x1:Tx0)]\max_{\theta} \mathbb{E}_{q(\bold{x}_0)}[\log p_{\theta}(\bold{x}_0)] \geq \max_{\theta} \mathbb{E}_{q(\bold{x}_0, \bold{x}_1, \dots, \bold{x}_T)} [\log p_{\theta}(\bold{x}_{0:T}) - \log q(\bold{x}_{1:T} | \bold{x}_0)]

where q(x1:Tx0)q(\bold{x}_{1:T}|\bold{x}_0)is some inference distribution (the forward process). DDPM consider it as the following Markov chain with Gaussian transitions parameterized by a decreasing sequence α1:T(0,1]T\alpha_{1:T} \in (0, 1]^T:

q(x1:Tx0)t=1Tq(xtxt1),where q(xtxt1)N(αtαt1xt1,(1αtαt1)I)q(\bold{x}_{1:T}|\bold{x}_0) \coloneqq \prod_{t=1}^T q(\bold{x}_t | \bold{x}_{t-1}), \text{where } q(\bold{x}_t | \bold{x}_{t-1}) \coloneqq \mathcal{N} \left( \sqrt{\frac{\alpha_t}{\alpha_{t-1}}} \bold{x}_{t-1}, (1 - \frac{\alpha_t}{\alpha_{t-1}}) \bold{I} \right)

A special property of the forward process is that:

q(xtx0)q(x1:tx0)dx1:(t1)=N(xt,αtx0,(1αt)I);q(\bold{x}_t | \bold{x}_0) \coloneqq \int q(\bold{x}_{1:t} | \bold{x}_0) d \bold{x}_{1:(t-1)} = \mathcal{N} (\bold{x}_t, \sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I});

Note: we need to make sure αT0\alpha_T \rightarrow 0, so that the data becomes standard Gaussian after adding noise.

By reparameterization trick, xt\bold{x}_t can be expressed as a linear combination of x0\bold{x}_0 and a noise variable ϵ\epsilon:

xt=αtx0+1αtϵ,whereϵN(0,I)\bold{x}_t = \sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon, \qquad \text{where} \quad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})

If all the conditionals are modeled as Gaussians with trainable mean functions and fixed variances, the variation lower bound objective can be simplified to:

Lγ(ϵθ)t=1TγtEx0q(x0),ϵtN(0,I)[ϵθ(t)(αtx0+1αtϵt)ϵt22]L_{\gamma} (\epsilon_{\theta}) \coloneqq \sum_{t=1}^T \gamma_t \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I})} [||\epsilon_{\theta}^{(t)} (\sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t) - \epsilon_t||_2^2]

where ϵθ{ϵθ(t)}t=1T\epsilon_{\theta} \coloneqq \{ \epsilon_{\theta}^{(t)}\}_{t=1}^T  is a set of TT functions, each ϵθ(t):XX\epsilon_{\theta}^{(t)}: \mathcal{X} \rightarrow \mathcal{X} is a function with trainable parameters θ(t)\theta^{(t)}, and γ[γ1,,γT]\gamma \coloneqq [\gamma_1, \dots, \gamma_T] is a vector of positive coefficients in the objective that depends on α1:T\alpha_{1:T}.

3. Why DDIM uses non-Markovian forward process

Note that in the loss of DDPM, ϵθ(t)(αtx0+1αtϵt)ϵt22||\epsilon_{\theta}^{(t)} (\sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t) - \epsilon_t||_2^2 only relies on the “overall step” distribution q(xtx0)q(\bold{x}_t | \bold{x}_0), but not directly on q(x1:Tx0)q(\bold{x}_{1:T} | \bold{x}_0).

There are many choices of inference distribution q(x1:Tx0)q(\bold{x}_{1:T} | \bold{x}_0) that have the same q(xtx0)q(\bold{x}_t | \bold{x}_0).

If we can keep the overall step q(xtx0)q(\bold{x}_t | \bold{x}_0) unchanged, then the objective LγL_{\gamma} doesn’t need to change(don’t need to train the model again).

Non-Markovian Forward Processes for a Discrete Case

The non-Markovian forward process works beyond Gaussian cases.(Appendix A)

Consider a categorical observation x0\bold{x}_0 that is a one-hot vector with K possible values, the forward process is defined as follow. First, we have q(xtx0)q(\bold{x}_t | \bold{x}_0) as the following categorical distribution:

q(xtx0)=Cat(αtx0+(1α)1K)q(\bold{x}_t | \bold{x}_0) = \text{Cat}(\alpha_t \bold{x}_0 + (1 - \alpha) \bold{1}_K)

where 1KRK\bold{1}_K \in \mathbb{R}^K is a vector with all entries being 1/K1 / K, and αt\alpha_t decreaseing from α0=1\alpha_0 = 1 to αT=0\alpha_T = 0. Then we define q(xt1xt,x0)q(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0) as the following mixture distribution:

q(xt1xt,x0)={Cat(xt)with probability σtCat(x0)with probability (αt1σtαt)Cat(1K)with probability (1αt1)(1αt)σtq(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0) = \begin{cases} \text{Cat}(\bold{x}_t) & \text{with probability } \sigma_t \\ \text{Cat}(\bold{x}_0) & \text{with probability } (\alpha_{t-1} - \sigma_t \alpha_t ) \\ \text{Cat}(\bold{1}_K) & \text{with probability }(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t\end{cases}

or equivalently:

q(xt1xt,x0)=Cat(σtxt+(αt1σtαt)x0+((1αt1)(1αt)σt)1K)q(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0) = \text{Cat}(\sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K)

This definition is consistent with how we defined q(xtx0)q(\bold{x}_t | \bold{x}_0).

Why consistent?

Plug in xt=αtx0+(1αt)1K\bold{x}_t = \alpha_t \bold{x}_0 + (1 - \alpha_t) \bold{1}_K into σtxt+(αt1σtαt)x0+((1αt1)(1αt)σt)1K\sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K, we get:

σtxt+(αt1σtαt)x0+((1αt1)(1αt)σt)1K=σt[αtx0+(1αt)1K]+(αt1σtαt)x0+((1αt1)(1αt)σt)1K=σtαtx0+σt(1αt)1K+αt1x0σtαtx0+(1αt1)1K(1αt)σt1K=αt1x0+(1αt1)1K\begin{align*} & \sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K \\ =& \sigma_t [\alpha_t \bold{x}_0 + (1 - \alpha_t) \bold{1}_K] + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K \\ =& \sigma_t \alpha_t \bold{x}_0 + \sigma_t(1 - \alpha_t) \bold{1}_K + \alpha_{t-1}\bold{x}_0 - \sigma_t \alpha_t \bold{x}_0 + (1 - \alpha_{t-1})\bold{1}_K - (1 - \alpha_t) \sigma_t \bold{1}_K \\ =& \alpha_{t-1} \bold{x}_0 + (1 - \alpha_{t-1}) \bold{1}_K\end{align*}

End

Similarly, we can define the reverse process pθ(xt1xt)p_{\theta}(\bold{x}_{t-1} | \bold{x}_t) as:

pθ(xt1xt)=Cat(σtxt+(αt1σtαt)fθ(t)(xt)+[(1αt1)(1αt)σt]1K)p_{\theta}(\bold{x}_{t-1} | \bold{x}_t) = \text{Cat} \left( \sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t) f_{\theta}^{(t)}(\bold{x}_t) + [(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t ] \bold{1}_K \right)

where fθ(t)(xt)f_{\theta}^{(t)}(\bold{x}_t)  maps xt\bold{x}_t to a K-dimensional vector. As (1αt1)(1αt)σt0(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t \rightarrow 0, the sampling process becomes less stochastic.

The objective will be the KL divergence:

DKL(q(xt1xt,x0)pθ(xt1xt))D_{\text{KL}}(q(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta} (\bold{x}_{t-1} | \bold{x}_t))

which is simply the KL divergence between tow categoricals. Since the KL divergence is convex, we can apply Jensen Inequality here and get the objective upperbound:

DKL(q(xt1xt,x0)pθ(xt1xt))σtDKL(Cat(xt)Cat(xt))+(αt1σtαt)DKL(Cat(x0)Cat(fθ(t)(xt)))+[(1αt1)(1αt)σt]Cat(1K)=0+(αt1σtαt)DKL(Cat(x0)Cat(fθ(t)(xt)))+0=(αt1σtαt)DKL(Cat(x0)Cat(fθ(t)(xt)))\begin{align*}& D_{\text{KL}}(q(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta} (\bold{x}_{t-1} | \bold{x}_t)) \\ \leq & \sigma_t D_{\text{KL}}(\text{Cat}(\bold{x}_t) || \text{Cat}(\bold{x}_t)) + (\alpha_{t-1} - \sigma_t \alpha_t) D_{\text{KL}}( \text{Cat}(\bold{x}_0)|| \text{Cat}( f_{\theta}^{(t)}(\bold{x}_t))) + [(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t] \text{Cat}(\bold{1}_K) \\ =& 0 + (\alpha_{t-1} - \sigma_t \alpha_t) D_{\text{KL}}( \text{Cat}(\bold{x}_0)|| \text{Cat}( f_{\theta}^{(t)}(\bold{x}_t))) + 0 \\ =& (\alpha_{t-1} - \sigma_t \alpha_t) D_{\text{KL}}( \text{Cat}(\bold{x}_0)|| \text{Cat}( f_{\theta}^{(t)}(\bold{x}_t)))\end{align*}

This also shows that how σt\sigma_t changes doesn’t affect the objective (up to re-weighting).

4. The Core Sampling Method of DDIM (and why)

Non-Markovian Forward Processes

Consider a family Q\mathcal{Q} of inference distributions, indexed by a real vector σR0T\sigma \in \mathbb{R}^T_{\geq 0}

qσ(x1:Tx0)qσ(xTx0)t=2Tqσ(xt1xt,x0)q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)

where qσ(xTx0)=N(αTx0,(1αT)I)q_{\sigma}(\bold{x}_T | \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_T} \bold{x}_0 , (1 - \alpha_T) \bold{I})  and for all t>1t > 1,

qσ(xt1xt,x0)=N(αt1x0+1αt1σt2xtαtx01αt,σt2I)q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I})

The mean function is chosen to ensure that qσ(xtx0)=N(αtx0,(1αt)I)q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) for all tt, so that the joint inference distribution matches the marginals as desired (matches the original DDPM’s overall step so that don’t need to train the model again).

How the sampling process is defined?

(1) DDPM Sampling: xtxt1x0\bold{x}_{t} \rightarrow \bold{x}_{t-1} \rightarrow \dots \rightarrow \bold{x}_{0}

p(xt1xt,x0)=p(xtxt1,x0)p(xt1x0)p(xtx0)p(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) = \frac{p(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0) \cdot p(\bold{x}_{t-1} | \bold{x}_0)}{p(\bold{x}_t | \bold{x}_0)}

The overall steps p(xtx0)p(\bold{x}_t | \bold{x}_0) and p(xt1x0)p(\bold{x}_{t-1} | \bold{x}_0) are fixed, and p(xtxt1,x0)p(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0) can be simplified by Markov property to p(xtxt1)p(\bold{x}_t | \bold{x}_{t-1}).

(2) Skip Steps (non-Markovian)

Consider s<ks < k, the sampling distribution can be written as:

p(xsxk,x0)=p(xkxs,x0)p(xsx0)p(xkx0)p(\bold{x}_{s} | \bold{x}_k, \bold{x}_0) = \frac{p(\bold{x}_k | \bold{x}_{s}, \bold{x}_0) \cdot p(\bold{x}_{s} | \bold{x}_0)}{p(\bold{x}_k | \bold{x}_0)}

The overall steps still are still known and doesn’t change:

p(xtx0)=N(αˉtx0,(1αˉt)I),in DDPM notationp(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} \bold{x}_0, (1 - \bar{\alpha}_t) \bold{I}), \qquad \text{in DDPM notation}

Now the problem is how to set the sampling distribution (undetermined coefficients method).

We still set p(xsxk,x0)p(\bold{x}_s | \bold{x}_k, \bold{x}_0) to a Gaussian. Assume that the mean function is a linear combination of xk\bold{x}_k and x0\bold{x}_0:

p(xsxk,x0)=N(nx0+mxk,σ2I)p(\bold{x}_s | \bold{x}_k, \bold{x}_0) = \mathcal{N} (n \bold{x}_0 + m\bold{x}_k, \sigma^2 \bold{I})

where n,m,σ2n, m, \sigma^2 are unknown and to be determined.

Reparameterization gives:

xs=(nx0+mxk)+σϵ,where ϵN(0,I)\bold{x}_s = (n \bold{x}_0 + m \bold{x}_k) + \sigma \epsilon', \quad \text{where } \epsilon' \sim \mathcal{N}(\bold{0}, \bold{I})

Reparameterization of the overall step gives:

xt=αˉtx0+1αˉtϵ,where ϵN(0,I)\bold{x}_t = \sqrt{\bar{\alpha}_t} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon'', \quad \text{where } \epsilon'' \in \mathcal{N}(\bold{0}, \bold{I})

Therefore we can derive:

xs=(nx0+mxk)+σϵ=(nx0+m(αˉkx0+1αˉkϵ))+σϵ=(n+mαˉk)x0+[m1αˉkϵ+σϵ]=(n+mαˉk)x0+m2(1αˉk)+σ2ϵ,where ϵN(0,I)\begin{align*}\bold{x}_s &= (n \bold{x}_0 + m\bold{x}_k) + \sigma \epsilon' \\ &= (n\bold{x}_0 + m(\sqrt{\bar{\alpha}_k} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_k} \epsilon'')) + \sigma \epsilon' \\ &= (n + m \sqrt{\bar{\alpha}_k}) \bold{x}_0 + [m\sqrt{1 - \bar{\alpha}_k} \epsilon'' + \sigma \epsilon'] \\ &= (n + m \sqrt{\bar{\alpha}_k}) \bold{x}_0 + \sqrt{m^2 (1 - \bar{\alpha}_k) + \sigma^2} \epsilon , \quad \text{where } \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})\end{align*}

The overall step also gives xs\bold{x}_s:

xs=xˉsx0+1αˉsϵ,where ϵN(0,I)\bold{x}_s = \sqrt{\bar{x}_s} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_s} \epsilon'', \quad \text{where } \epsilon'' \in \mathcal{N}(\bold{0}, \bold{I})

Using the method of undetermined coefficients, we have:

{n+mαˉk=αˉsm2(1αˉk)+σ2=1αˉs\begin{cases} n + m\sqrt{\bar{\alpha}_k} = \sqrt{\bar{\alpha}_s} \\ m^2(1 - \bar{\alpha}_k) + \sigma^2 = 1 - \bar{\alpha}_s\end{cases}

Here we choose σ\sigma as the free variable and solve nn and mm:

{m=1αˉsσ21αˉkn=αˉs1αˉsσ2αˉk1αˉk\begin{cases}m = \frac{\sqrt{1 - \bar{\alpha}_s - \sigma^2}}{\sqrt{1 - \bar{\alpha}_k}} \\ n = \sqrt{\bar{\alpha}_s} - \sqrt{1 - \bar{\alpha}_s - \sigma^2} \cdot \frac{\sqrt{\bar{\alpha}_k}}{\sqrt{1 - \bar{\alpha}_k}} \end{cases}

So we have:

p(xsxk,x0)=N((αˉs1αˉsσ2αˉk1αˉk)x0+1αˉsσ21αˉkxk,σ2I)=N(αˉsx0+1αˉsσ2xkαˉkx01αˉk,σ2I)\begin{align*}p(\bold{x}_s | \bold{x}_k, \bold{x}_0) &= \mathcal{N} ((\sqrt{\bar{\alpha}_s} - \sqrt{1 - \bar{\alpha}_s - \sigma^2} \cdot \frac{\sqrt{\bar{\alpha}_k}}{\sqrt{1 - \bar{\alpha}_k}}) \bold{x}_0 + \frac{\sqrt{1 - \bar{\alpha}_s - \sigma^2}}{\sqrt{1 - \bar{\alpha}_k}}\bold{x}_k, \sigma^2 \bold{I}) \\ &= \mathcal{N}(\sqrt{\bar{\alpha}_s} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_s - \sigma^2} \cdot \frac{\bold{x}_k - \sqrt{\bar{\alpha}_k} \bold{x}_0}{\sqrt{1 - \bar{\alpha}_k}}, \sigma^2 \bold{I})\end{align*}

Change the DDPM notation to DDIM notaion:

p(xsxk,x0)=N(αsx0+1αsσ2xkαkx01αk,σ2I)p(\bold{x}_s | \bold{x}_k, \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_s} \bold{x}_0 + \sqrt{1 - \alpha_s - \sigma^2} \cdot \frac{\bold{x}_k - \sqrt{\alpha_k} \bold{x}_0}{\sqrt{1 - \alpha_k}}, \sigma^2 \bold{I})

When σ=0\sigma = 0, it’s DDIM. When σ=σDDPM\sigma = \sigma_{\text{DDPM}}, it’s DDPM. One can also interpolate in between.

5. DDIM’s Overall Step is Unchanged

(Appendix B, Lemma 1)

Lemma 1. For qσ(x1:Tx0)qσ(xTx0)t=2Tqσ(xt1xt,x0)q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) and qσ(xt1xt,x0)=N(αt1x0+1αt1σt2xtαtx01αt,σt2I)q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I}) , we have:

qσ(xtx0)=N(αtx0,(1αt)I)q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I})

Proof:

Prove by induction for tt from TT to 11.

By definition of qσ(xTx0)q_{\sigma}(\bold{x}_T | \bold{x}_0), t=Tt = T holds.

Assume for any tTt \leq T, qσ(xtx0)=N(αtx0,(1αt)I)q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) holds.

First, we have:

qσ(xt1x0)=xtqσ(xtx0)qσ(xt1xt,x0)dxtq_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) = \int_{\bold{x}_t} q_{\sigma} (\bold{x}_{t} | \bold{x}_0) q_{\sigma} (\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) d \bold{x}_t

and

qσ(xtx0)=N(αtx0,(1αt)I)q_{\sigma} (\bold{x}_{t} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I})
qσ(xt1xt,x0)=N(αt1x0+1αt1σt2xtαtx01αt,σt2I)q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I})
📖

Bishop (2006) (2.115)

If we have a Gaussian distribution for x\bold{x} and a conditional Gaussian distribution for y\bold{y} give x\bold{x} in the form

p(x)=N(xμ,Λ1)p(\bold{x}) = \mathcal{N}(\bold{x} | \mu, \Lambda^{-1})
p(yx)=N(yAx+b,L1)p(\bold{y} | \bold{x}) = \mathcal{N} (\bold{y} | \bold{Ax} + \bold{b}, \bold{L^{-1}})

the marginal distribution of y\bold{y} and the conditional distribution of x\bold{x} given y\bold{y} are given by:

p(y)=N(yAμ+b,L1+AΛ1AT)p(\bold{y}) = \mathcal{N} (\bold{y} | \bold{A \mu} + \bold{b}, \bold{L^{-1}} + \bold{A \Lambda^{-1} A^{T}} )
p(xy)=N(xΣ[ATL(yb)+Λμ,Σ])p(\bold{x} | \bold{y}) = \mathcal{N}(\bold{x} | \Sigma [\bold{A^T L (y - b) + \Lambda \mu}, \Sigma])

where Σ=(Λ+ATLA)1\Sigma = (\Lambda + \bold{A^T L A})^{-1}.

From Bishop (2006) (2.115), we have that qσ(xt1x0)q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) is Gaussian, denoted as N(μt1,Σt1)\mathcal{N}(\mu_{t-1}, \Sigma_{t-1})

where

μt1=αt1x0+1αt1σt2αtx0αtx01αt=αt1x0\begin{align} \mu_{t-1} &= \sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\sqrt{\alpha}_t \bold{x}_0 - \sqrt{\alpha}_t \bold{x}_0}{\sqrt{1 - \alpha_t}} \\ &= \sqrt{\alpha_{t-1}} \bold{x}_0 \end{align}

and

Σt1=σt2I+1αt1σt21αt(1αt)I=(1αt1)I\Sigma_{t-1} = \sigma_t^2 \bold{I} + \frac{1 - \alpha_{t-1} - \sigma_t^2}{1 - \alpha_t} (1 - \alpha_t) \bold{I} = (1 - \alpha_{t-1}) \bold{I}

Therefore, qσ(xt1x0)=N(αt1x0,(1αt1)I)q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{t-1}} \bold{x}_0, (1 - \alpha_{t-1}) \bold{I}), which allows the induction.

End of Proof.

6. Generative Process and Unified Variational Inference Objective

Generative Process

Intuitively, given a noisy observation xt\bold{x}_t, we first make a prediction of the corresponding x0\bold{x}_0, and the use it to obtain a sample xt1\bold{x}_{t-1} througth the reverse conditional distribution qσ(xt1xt,x0)q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0), which is defined above.

For some x0q(x0)\bold{x}_0 \sim q(\bold{x}_0) and ϵtN(0,I)\epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I}), xt\bold{x}_t can be obtained by the overall step of the forward process:

xt=αtx0+1αtϵtwhere ϵtN(0,I)\bold{x}_t = \sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t \qquad \text{where } \epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I})

Then model ϵθ(t)(xt)\epsilon_{\theta}^{(t)}(\bold{x}_t) then attempts to predict ϵt\epsilon_t without the knowledge of x0\bold{x}_0. By rewriting the above overall step, we can get hte prediction of x0\bold{x}_0 give xt\bold{x}_t:

fθ(t)(xt1xt)(xt1αtϵθ(t)(xt))/αtf_{\theta}^{(t)} (\bold{x}_{t-1} | \bold{x}_t) \coloneqq (\bold{x}_t - \sqrt{1 - \alpha_t} \cdot \epsilon_{\theta}^{(t)} (\bold{x}_t)) / \sqrt{\alpha_t}

We can define a fixed prior pθ(xT)=N(0,I)p_{\theta}(\bold{x}_T) = \mathcal{N}(\bold{0}, \bold{I}) for the generative process and

pθ(t)(xt1xt)={N(fθ(1)(x1),σ12I)if t=1qσ(xt1xt,fθ(1)(xt))otherwisep_{\theta}^{(t)} (\bold{x}_{t-1} | \bold{x}_t) = \begin{cases} \mathcal{N}(f_{\theta}^{(1)} (\bold{x}_1), \sigma_1^2 \bold{I}) & \text{if }t = 1 \\ q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, f_{\theta}^{(1)} (\bold{x}_t)) & \text{otherwise} \end{cases}

Variational Inference Objective

θ\theta can be optimized following variational inference objective (which is a functional over ϵθ\epsilon_{\theta}):

Jσ(ϵθ)Ex0:Tqσ(x0:T)[logqσ(x1:Tx0)logpθ(x0:T)]=Ex0:Tqσ(x0:T)[logqσ(xTx0)+t=2Tlogqσ(xt1xt,x0)t=1Tlogpθ(t)(xt1xt)logpθ(xT)]\begin{align*} J_{\sigma} (\epsilon_{\theta}) & \coloneqq \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})} [\log q_{\sigma} (\bold{x}_{1:T} | \bold{x}_0) - \log p_{\theta}(\bold{x}_{0:T})] \\ & = \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})} \left[ \log q_{\sigma}(\bold{x}_T | \bold{x}_0) + \sum_{t=2}^T \log q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) - \sum_{t=1}^T \log p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t) - \log p_{\theta}(\bold{x}_T) \right] \end{align*}

where qσ(x1:Tx0)q_{\sigma} (\bold{x}_{1:T} | \bold{x}_0) is factorized by qσ(x1:Tx0)qσ(xTx0)t=2Tqσ(xt1xt,x0)q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) and pθ(x0:T)p_{\theta}(\bold{x}_{0:T}) is factorized by pθ(x0:T)pθ(xT)t=1Tpθ(t)(xt1xt)p_{\theta} (\bold{x}_{0:T}) \coloneqq p_{\theta} (\bold{x}_T) \prod_{t=1}^T p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)

It can be proved that JσJ_{\sigma} is equivalent to LγL_{\gamma} (objective of DDPM) for certain weights γ\gamma. Therefore, the model ϵθ\epsilon_{\theta} doesn’t require retraining, and we can just use the model trained by DDPM.

Theorem 1. For all σ>0\sigma > 0, there exists γR>0(t)\gamma \in \mathbb{R}_{>0}^{(t)} and CRC \in \mathbb{R}, such that Jσ=Lγ+CJ_{\sigma} = L_{\gamma} + C.

Proof:

From the definition of JσJ_{\sigma}:

JσEx0:Tq(x0:T)[logqσ(xTx0)+t=2Tlogqσ(xt1xt,x0)t=1Tlogpθ(t)(xt1xt)]Ex0:Tq(x0:T)[t=2TDKL(qσ(xt1xt,x0)pθ(t)(xt1xt))logpθ(1)(x0x1)]\begin{align*} J_{\sigma} & \coloneqq \mathbb{E}_{\bold{x}_{0:T} \sim q(\bold{x}_{0:T})} \left[ \log q_{\sigma}(\bold{x}_T | \bold{x}_0) + \sum_{t=2}^T \log q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) - \sum_{t=1}^T \log p_{\theta}^{(t)} (\bold{x}_{t-1} | \bold{x}_t) \right] \\ & \equiv \mathbb{E}_{\bold{x}_{0:T} \sim q(\bold{x}_{0:T})} \left[ \sum_{t=2}^T D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1}|\bold{x}_t, \bold{x}_0)||p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)) - \log p_{\theta}^{(1)}(\bold{x}_0 | \bold{x}_1) \right] \end{align*}

where we use \equiv to denote “equal up to a value that does not depend on ϵθ\epsilon_{\theta} (but may depend on qσq_{\sigma})”.

How is the \equiv derived?

Consider the part of t=2Tlogqσ(xt1xt,x0)pθ(t)(xt1xt)\sum_{t=2}^T \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} :

Ex0:Tqσ(x0:T)[t=2Tlogqσ(xt1xt,x0)pθ(t)(xt1xt)]=t=2TE(x0,xt1,xt)qσ(x0,xt1,xt)[logqσ(xt1xt,x0)pθ(t)(xt1xt)]=t=2Tqσ(x0,xt1,xt)logqσ(xt1xt,x0)pθ(t)(xt1xt)dx0dxt1dxt=t=2Tqσ(xt,x0)qσ(xt1xt,x0)logqσ(xt1xt,x0)pθ(t)(xt1xt)dx0dxt1dxt=t=2TE(x0,xt)qσ(x0,xt)[qσ(xt1xt,x0)logqσ(xt1xt,x0)pθ(t)(xt1xt)dxt1]=t=2TE(x0,xt)qσ(x0,xt)[DKL(qσ(xt1xt,x0)pθ(t)(xt1xt))]=Ex0:Tqσ(x0:T)[DKL(qσ(xt1xt,x0)pθ(t)(xt1xt))]\begin{align*} &\mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})}\left[ \sum_{t=2}^T \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \right] \\ =& \sum_{t=2}^T \mathbb{E}_{(\bold{x}_0, \bold{x}_{t-1}, \bold{x}_t) \sim q_{\sigma}(\bold{x}_0, \bold{x}_{t-1}, \bold{x}_t)} \left[\log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \right] \\ =& \sum_{t=2}^T \int q_{\sigma}(\bold{x}_0, \bold{x}_{t-1}, \bold{x}_t) \cdot \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \quad d\bold{x}_0 d\bold{x}_{t-1} d\bold{x}_t \\ =& \sum_{t=2}^T \int q_{\sigma}(\bold{x}_t, \bold{x}_0) \cdot q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) \cdot \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \quad d\bold{x}_0 d\bold{x}_{t-1} d\bold{x}_t \\ =& \sum_{t=2}^T \mathbb{E}_{(\bold{x}_0, \bold{x}_t) \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ \int q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) \cdot \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \quad d \bold{x}_{t-1} \right] \\ =& \sum_{t=2}^T \mathbb{E}_{(\bold{x}_0, \bold{x}_t) \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_{t-1}|\bold{x}_t) )\right] \\ =& \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_{t-1}|\bold{x}_t) )\right] \end{align*}

For t>1t > 1, by the definition of pθ(t)(xt1xt)p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t):

Ex0,xtqσ(x0,xt)[DKL(qσ(xt1xt,x0)pθ(t)(xt1xt))]=Ex0,xtqσ(x0,xt)[DKL(qσ(xt1xt,x0)qσ(xt1xt,fθ(t)(xt)))]Ex0,xtqσ(x0,xt)[x0fθ(t)(xt)222σt2]=Ex0q(x0),ϵN(0,I),xt=αtx0+1αtϵ[(xt1αtϵ)αt(xt1αtϵθ(t)(xt))αt222σt2]=Ex0q(x0),ϵN(0,I),xt=αtx0+1αtϵ[ϵϵθ(t)(xt)222dσt2αt]\begin{align*} &\mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_{t-1}|\bold{x}_t) )\right] \\ =& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || q_{\sigma}(\bold{x}_{t-1}|\bold{x}_t, f_{\theta}^{(t)}(\bold{x}_t)) )\right] \\ \equiv & \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ \frac{||\bold{x}_0 - f_{\theta}^{(t)}(\bold{x}_t)||_2^2}{2 \sigma_t^2} \right] \\ =& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_t=\sqrt{\alpha_t}\bold{x}_0 + \sqrt{1 - \alpha_t}\epsilon} \left[ \frac{||\frac{(\bold{x}_t - \sqrt{1 - \alpha_t}\epsilon)}{\sqrt{\alpha_t}} - \frac{(\bold{x}_t - \sqrt{1 - \alpha_t}\epsilon_{\theta}^{(t)}(\bold{x}_t))}{\sqrt{\alpha_t}}||_2^2}{2 \sigma_t^2} \right] \\ =& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_t=\sqrt{\alpha_t}\bold{x}_0 + \sqrt{1 - \alpha_t}\epsilon}\left[ \frac{||\epsilon - \epsilon_{\theta}^{(t)}(\bold{x}_t)||_2^2}{2d\sigma_t^2 \alpha_t} \right] \end{align*}

where dd is the dimension of x0\bold{x}_0.

Line 3 “\equiv”:

Recall that the KL Divergence between tow Gaussian distributions is:

DKL(N(x;μx,Σx)(N(y;μy,Σy))=12[logΣyΣxd+tr(Σy1Σx)+(μyμx)TΣy1(μyμx)]D_{\text{KL}} (\mathcal{N}(\bold{x}; \mu_{x}, \Sigma_x)||(\mathcal{N}(\bold{y}; \mu_{y}, \Sigma_y)) = \frac{1}{2} \left[ \log \frac{|\Sigma_y|}{|\Sigma_x|} - d + \text{tr}(\Sigma_y^{-1} \Sigma_x) + (\mu_y - \mu_x)^T \Sigma_y^{-1} (\mu_y - \mu_x) \right]

When Σx=Σy=σt2I\Sigma_x = \Sigma_y = \sigma_t^2 \bold{I}, this simplifies to:

DKL(N(x;μx,Σx)(N(y;μy,Σy))=12σt2μxμy22D_{\text{KL}} (\mathcal{N}(\bold{x}; \mu_{x}, \Sigma_x)||(\mathcal{N}(\bold{y}; \mu_{y}, \Sigma_y)) = \frac{1}{2 \sigma_t^2} ||\mu_x - \mu_y||_2^2

The expectation of qσ(xt1xt,x0)q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) can be written as nxt+mx0n \cdot \bold{x}_t + m \cdot \bold{x}_0 where nn and mm are known constant. Then,

Ex0,xtqσ(x0,xt)[DKL(qσ(xt1xt,x0)qσ(xt1xt,fθ(t)(xt)))]=Ex0,xtqσ(x0,xt)[m212σt2μxμy22]=Ex0,xtqσ(x0,xt)[12σt2μxμy22]\begin{align*} & \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || q_{\sigma}(\bold{x}_{t-1}|\bold{x}_t, f_{\theta}^{(t)}(\bold{x}_t)) )\right] \\ =& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ m^2 \frac{1}{2 \sigma_t^2} ||\mu_x - \mu_y||_2^2 \right] \\ =& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ \frac{1}{2 \sigma_t^2} ||\mu_x - \mu_y||_2^2 \right] \end{align*}

Question: Why in Line 5, there is suddenly a dd in the denominator and the 1αt\sqrt{1 - \alpha_t} disappears in the nominator???

For t=1t=1:

Ex0,x1qσ(x0,x1)[logpθ(1)(x0x1)]=Ex0,x1qσ(x0,x1)[constant+x0fθ(1)(x1)222σ12]Ex0,x1qσ(x0,x1)[x0fθ(1)(x1)222σ12]=Ex0q(x0),ϵN(0,I),x1=α1x0+1α1ϵ[x11α1ϵα1x11α1ϵθ(1)(x1)α1222σ12]=Ex0q(x0),ϵN(0,I),x1=α1x0+1α1ϵ[ϵϵθ(1)(x1)222dσ12α1]\begin{align*} &\mathbb{E}_{\bold{x}_0, \bold{x}_1 \sim q_{\sigma}(\bold{x}_0, \bold{x}_1)} \left[ -\log p_{\theta}^{(1)} (\bold{x}_0 | \bold{x}_1) \right] \\ =& \mathbb{E}_{\bold{x}_0, \bold{x}_1 \sim q_{\sigma}(\bold{x}_0, \bold{x}_1)} \left[ \text{constant} + \frac{||\bold{x}_0 - f_{\theta}^{(1)}(\bold{x}_1)||_2^2}{2\sigma_1^2} \right] \\ \equiv & \mathbb{E}_{\bold{x}_0, \bold{x}_1 \sim q_{\sigma}(\bold{x}_0, \bold{x}_1)} \left[ \frac{||\bold{x}_0 - f_{\theta}^{(1)}(\bold{x}_1)||_2^2}{2\sigma_1^2} \right] \\ =& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_1=\sqrt{\alpha_1} \bold{x}_0 + \sqrt{1 - \alpha_1}\epsilon} \left[ \frac{||\frac{\bold{x}_1 - \sqrt{1 - \alpha_1}\epsilon}{\sqrt{\alpha_1}} - \frac{\bold{x}_1 - \sqrt{1 - \alpha_1}\epsilon_{\theta}^{(1)}(\bold{x}_1)}{\sqrt{\alpha_1}}||_2^2}{2\sigma_1^2} \right] \\ =& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_1=\sqrt{\alpha_1} \bold{x}_0 + \sqrt{1 - \alpha_1}\epsilon} \left[ \frac{||\epsilon - \epsilon_{\theta}^{(1)}(\bold{x}_1)||_2^2}{2d\sigma_1^2 \alpha_1} \right] \end{align*}

In the last line, somehow the dd appears and 1α1\sqrt{1 - \alpha_1} disappears again.

Therefore, when γt=1/(2dσt2αt)\gamma_t = 1 / (2 d \sigma_t^2 \alpha_t), for all t{1,,T}t \in \{1, \dots, T\}, we have

Jσ(ϵθ)t=1T12dσt2αtE[ϵθ(t)(xt)ϵt22]=Lγ(ϵθ)J_{\sigma}(\epsilon_{\theta}) \equiv \sum_{t=1}^T \frac{1}{2d\sigma_t^2 \alpha_t} \mathbb{E}\left[ ||\epsilon_{\theta}^{(t)}(\bold{x}_t) - \epsilon_t||_2^2 \right] = L_{\gamma}(\epsilon_{\theta})

for all ϵθ\epsilon_{\theta}.

End of Proof.

7. Sampling from Generalized Generative Processes

IDEA: Use pretrained DDPM models as the solutions to the new objectives, and focus on finding a generative process that is better at producing samples by changing σ\sigma.

Denoising Diffusion Implicit Models

One can generate a sample xt1\bold{x}_{t-1} from a sample xt\bold{x}_t via:

xt1=αt1(xt1αtϵθ(t)(xt)αt)"predicted x0"+1αt1σt2ϵθ(t)(xt)"direction pointing to xt"+σtϵtrandom noise\bold{x}_{t-1} = \sqrt{\alpha_{t-1}} \underbrace{\left( \frac{\bold{x}_t - \sqrt{1 - \alpha_t} \epsilon_{\theta}^{(t)}(\bold{x}_t)}{\sqrt{\alpha_t}} \right)}_{\text{"predicted } \bold{x}_0"} + \underbrace{\sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \epsilon_{\theta}^{(t)} (\bold{x}_t)}_{\text{"direction pointing to } \bold{x}_t "} + \underbrace{\sigma_t \epsilon_t}_{\text{random noise}}

where ϵtN(0,I)\epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I}) is standard Gaussian noise independent of xt\bold{x}_t, and define α01\alpha_0 \coloneqq 1.

Different σ\sigma values results in different generative processes, all using the same model ϵθ\epsilon_{\theta}, so re-training the model is uncessary. When σt=(1αt1/(1αt)1αt/αt1\sigma_t = \sqrt{(1 - \alpha_{t-1} / (1 - \alpha_t)} \sqrt{1 - \alpha_t / \alpha_{t-1}} for all t, the forward process becomes Markovian and the generative process becomes a DDPM.

Another special case is σt=0\sigma_t = 0 for all tt. The forward process qσ(xtxt1,x0)=qσ(xt1xt,x0)qσ(xtx0)qσ(xt1x0)q_{\sigma}(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0) = \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) q_{\sigma}(\bold{x}_t | \bold{x}_0)}{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_0)} is now deterministic given xt1\bold{x}_{t-1} and x0\bold{x}_0, except for t=1t = 1 (because qσ(x1x0)q_{\sigma}(\bold{x}_1 | \bold{x}_0) can not be derived from the above Bayesian Rule). The generative also becomes deterministic because the coefficient of the random noise ϵt\epsilon_{t} becomes zero.

This model with σt=0\sigma_t = 0 is named denoising diffusion implicit model (DDIM).

More generally, we can define σt=ησDDPM\sigma_t = \eta \cdot \sigma_{\text{DDPM}}. When η=1\eta = 1, it’s DDPM. When η=0\eta = 0, it’s DDIM.

Accelerated Generation Processes

As the denoising objective L1L_{1} doesn’t depend on the specific forward procedure as long as qσ(xtx0)q_{\sigma}(\bold{x}_t | \bold{x}_0) is fixed, we may consider forward processes with lengths samller than TT. which accelerates the corresponding generative processes without having to train a different model.

Consider a subsequence {xτ1,,xτS}\{ \bold{x}_{\tau_1} , \dots, \bold{x}_{\tau_S} \} of {1,T}\{ 1, \dots T \}. A new forward process is defined: xτ1xτS\bold{x}_{\tau_1} \dots \bold{x}_{\tau_S} such that q(xτix0)=N(ατix0,(1ατi)I)q(\bold{x}_{\tau_i} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{\tau_i}}\bold{x}_0, (1 - \alpha_{\tau_i}) \bold{I}). The generative process now samples latent variables according to reversed(τ\tau), which is termed (sampling) trajectory.

Now we show that the training objective is still unchanged so that we don’t need to retrain the model.

Consider the inference process to be factored as:

qσ,τ(xtx0)=qσ,τ(xτSx0)i=1Sqσ,τ(xτi1xτi,x0)tτˉqσ.τ(xtx0)q_{\sigma, \tau}(\bold{x}_t | \bold{x}_0) = q_{\sigma, \tau}(\bold{x}_{\tau_S} | \bold{x}_0) \prod_{i=1}^{S} q_{\sigma, \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) \prod_{t \in \bar{\tau}}q_{\sigma. \tau}(\bold{x}_t | \bold{x}_0)

where τ\tauis a sub-sequence of [1,,T][1, \dots, T] of length SS with τS=T\tau_S = T, and let τˉ{1,,T}\τ\bar{\tau} \coloneqq \{ 1, \dots, T \} \backslash \tau be its complement. We define:

qσ.τ(xtx0)=N(αtx0,(1αt)I)tτˉ{T}q_{\sigma. \tau}(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) \qquad \forall t \in \bar{\tau} \cup \{ T \}
qσ,τ(xτi1xτi,x0)=N(ατi1x0+1ατi1στi2xτiατix01ατi,στi2I)i[S]q_{\sigma, \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) = \mathcal{N} \left( \sqrt{\alpha_{\tau_{i-1}}}\bold{x}_0 + \sqrt{1 - \alpha_{\tau_{i-1}} - \sigma_{\tau_i}^2} \cdot \frac{\bold{x_{\tau_i}-\sqrt{\alpha_{\tau_i}}\bold{x}_0}}{\sqrt{1 - \alpha_{\tau_i}}}, \sigma_{\tau_i}^2 \bold{I} \right) \quad \forall i \in [S]

where coefficients are chosen such that:

qσ,τ(xτix0)=N(ατix0,(1ατi)I)i[S]q_{\sigma, \tau}(\bold{x}_{\tau_i} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{\tau_i}} \bold{x}_0, (1 - \alpha_{\tau_i})\bold{I}) \quad \forall i \in [S]

i.e., the “marginals” match.

The corresponding “generative process” is defined as:

pθ(x0:T)pθ(xT)i=1Spθτi(xτi1xτ1)use to produce samples×tτˉpθ(t)(x0xt)in variational objectivep_{\theta}(\bold{x}_{0:T}) \coloneqq \underbrace{p_{\theta}(\bold{x}_T)\prod_{i=1}^S p_{\theta}^{\tau_i}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_1})}_{\text{use to produce samples}} \times \underbrace{\prod_{t \in \bar{\tau}}p_{\theta}^{(t)} (\bold{x}_0 | \bold{x}_t)}_{\text{in variational objective}}

where only part of the models are being used to produce samples. The conditionals are:

pθ(τi)(xτi1xτi)=qσ,τ(xτi1xτi,fθ(τi)(xτi1))if i[S],i>1p_{\theta}^{(\tau_i)}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}) = q_{\sigma, \tau} (\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, f_{\theta}^{(\tau_i)}(\bold{x}_{\tau_{i-1}})) \quad \text{if } i\in [S], i>1
pθ(t)(x0xt)=N(fθ(t)(xt),σt2I)otherwise,p_{\theta}^{(t)}(\bold{x}_0 | \bold{x}_t) = \mathcal{N}(f_{\theta}^{(t)}(\bold{x}_t), \sigma_t^2 \bold{I}) \quad \text{otherwise,}

where we leverage qσ,τ(xτi1xτi,x0)q_{\sigma, \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) as part of the inference process. The resulting variational objective becomes (define xτS+1=\bold{x}_{\tau_{S + 1}} = \empty for conciseness):

J(ϵθ)=Ex0:Tqσ,τ(x0:T)[logqσ,τ(x1:Tx0)logpθ(x0:T)]=Ex0:Tqσ,τ(x0:T)[tτˉDKL(qσ,τ(xtx0)pθ(t)(x0xt))+i=1SDKL(qσ.τ(xτi1xτi,x0)pθ(τi)(xτi1xτi))]\begin{align*} J(\epsilon_{\theta}) &= \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma, \tau}(\bold{x}_{0:T})}[\log q_{\sigma, \tau}(\bold{x}_{1:T} | \bold{x}_0) - \log p_{\theta}(\bold{x}_{0:T})] \\ &= \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma, \tau}(\bold{x}_{0:T})} \left[ \sum_{t \in \bar{\tau}}D_{\text{KL}}(q_{\sigma, \tau} (\bold{x}_t | \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_0 | \bold{x}_t)) + \sum_{i=1}^S D_{\text{KL}}(q_{\sigma. \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) || p_{\theta}^{(\tau_i)}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i})) \right] \end{align*}

where each KL divergence is between two Gaussians with variance independent of θ\theta. A similar argument to the proof used in Theorem 1 can show that the variational objective JJ can also be converted to an objective of the form LγL_{\gamma}. (Not fully understood yet😢)

8. Relevance to Neural ODE

The sampling equation is given by:

xt1=αt1(xt1αtϵθ(t)(xt)αt)"predicted x0"+1αt1σt2ϵθ(t)(xt)"direction pointing to xt"+σtϵtrandom noise\bold{x}_{t-1} = \sqrt{\alpha_{t-1}} \underbrace{\left( \frac{\bold{x}_t - \sqrt{1 - \alpha_t} \epsilon_{\theta}^{(t)}(\bold{x}_t)}{\sqrt{\alpha_t}} \right)}_{\text{"predicted } \bold{x}_0"} + \underbrace{\sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \epsilon_{\theta}^{(t)} (\bold{x}_t)}_{\text{"direction pointing to } \bold{x}_t "} + \underbrace{\sigma_t \epsilon_t}_{\text{random noise}}

The corresponding ODE is:

dxˉ(t)=ϵθ(t)(xˉ(t)σ2(t)+1)dσ(t)d \bar{\bold{x}}(t) = \epsilon_{\theta}^{(t)} \left( \frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}} \right) d \sigma(t)

where σ=(1α/α)\sigma = (\sqrt{1 - \alpha} / \sqrt{\alpha}) and xˉ=(x/α)\bar{\bold{x}} = (\bold{x} / \sqrt{\alpha}).

We want to derive the ODE form of this sampling equation.

We also want to prove that the ODE form of DDIM is equivalent to the ODE form of VE-SDE.

Proposition 1. The ODE form of DDIM with the optimal model ϵθ(t)\epsilon_{\theta}^{(t)} has an equivalent probability flow ODE corresponding to the “Variance-Exploding” SDE in Song et al. (2020).

Proof.

We consider tt as a continuous, independent “time” variable and x\bold{x} and α\alpha as functions of tt. Let’s consider a reparametrization between DDIM and the VE-SDE by introducing the variables xˉ\bar{\bold{x}} and σ\sigma:

xˉ(t)=xˉ(0)+σ(t)ϵ,ϵN(0,I)\bar{\bold{x}} (t) = \bar{\bold{x}}(0) + \sigma(t)\epsilon , \qquad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})

for t[0,)t \in [0, \infty) and an increasing continuous function σ:R0R0\sigma: \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}_{\geq 0} where σ(0)=0\sigma(0) = 0.

We can the define α(t)\alpha(t) and x(t)\bold{x}(t) corresponding to DDIM case as:

xˉ(t)=x(t)α(t)\bar{\bold{x}} (t) = \frac{\bold{x}(t)}{\sqrt{\alpha(t)}}
σ(t)=1α(t)α(t)\sigma(t) = \sqrt{\frac{1 - \alpha(t)}{\alpha(t)}}

This also means that:

x(t)=xˉ(t)σ2(t)+1\bold{x}(t) = \frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}}
α(t)=11+σ2(t)\alpha(t) = \frac{1}{1 + \sigma^2(t)}

which establishes an bijection between (x,α)(\bold{x}, \alpha) and (xˉ,σ)(\bar{\bold{x}}, \sigma).

Given that α(0)=1\alpha(0) = 1 and the overall step xt=αtx0+1αtϵ\bold{x}_t = \sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon, we have:

x(t)α(t)=x(0)α(0)+1α(t)α(t)ϵ,ϵN(0,I)\frac{\bold{x}(t)}{\sqrt{\alpha(t)}} = \frac{\bold{x}(0)}{\sqrt{\alpha(0)}} + \sqrt{\frac{1 - \alpha(t)}{\alpha(t)}} \epsilon, \quad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})

which can be reparametrized into a form that is consistent with VE-SDE:

xˉ(t)=xˉ(0)+σ(t)ϵ\bar{\bold{x}}(t) = \bar{\bold{x}}(0) + \sigma(t) \epsilon

Now we derive the ODE forms for both DDIM and VE-SDE and show that they are equivalent.

ODE form for DDIM

The sampling equation can be rewritten into (σt=0\sigma_t = 0 for DDIM):

xtΔtαtΔt=xtαt+(1αtΔtαtΔt1αtαt)ϵθ(t)(xt)\frac{\bold{x}_{t - \Delta t}}{\sqrt{\alpha_{t - \Delta t}}} = \frac{\bold{x}_t}{\sqrt{\alpha_t}} + \left( \sqrt{\frac{1 - \alpha_{t - \Delta t}}{\alpha_{t - \Delta t}}} - \sqrt{\frac{1 - \alpha_t}{\alpha_t}} \right) \epsilon_{\theta}^{(t)} (\bold{x}_t)

which is equivalent to:

xˉ(tΔt)=xˉ(t)+(σ(tΔt)σ(t))ϵθ(t)(xt)\bar{\bold{x}}(t - \Delta t) = \bar{\bold{x}}(t) + (\sigma(t - \Delta t) - \sigma(t)) \cdot \epsilon_{\theta}^{(t)}(\bold{x}_t)

Divide both side by (Δt- \Delta t) and as Δt0\Delta t \rightarrow 0, we have:

dxˉ(t)dt=dσ(t)dtϵθ(t)(xˉ(t)σ2(t)+1)\frac{\text{d} \bar{\bold{x}} (t)}{\text{d} t} = \frac{\text{d} \sigma(t)}{\text{d} t} \epsilon_{\theta}^{(t)} \left( \frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}} \right)

ODE form for VE-SDE

Define pt(xˉ)p_t(\bar{\bold{x}})as the data distribution perturbed with σ2(t)\sigma^2(t) variance Gaussian noise. The probability flow for VE-SDE is defined as:

dxˉ=12g(t)2xˉlogpt(xˉ)dt\text{d} \bar{\bold{x}} = -\frac{1}{2} g(t)^2 \nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) \text{d} t

where g(t)=dσ2(t)dtg(t) = \sqrt{\frac{\text{d} \sigma^2(t)}{\text{d} t}} is the diffusion coefficient, and xˉlogpt(xˉ)\nabla_{\bar{\bold{x}}} \log p_t (\bar{\bold{x}}) is the score of ptp_t.

The σ(t)\sigma(t)-perturbed score function xˉlogpt(xˉ)\nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) is also a minimizer (from denoising score matching):

xˉlogpt(xˉ)=arg mingtEx(0)q(x),ϵN(0,I)[gt(xˉ)+ϵ/σ(t)22]\nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) = \argmin_{g_t} \mathbb{E}_{\bold{x}(0) \sim q(\bold{x}), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})} [||g_t(\bar{\bold{x}}) + \epsilon / \sigma(t)||_2^2]

Given the assumption that ϵθ(t)\epsilon_{\theta}^{(t)} is the optimal model, we have:

xˉlogpt(xˉ)=ϵθ(t)(xt)σ(t)=ϵθ(t)(xˉ(t)σ2(t)+1)σ(t)\nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) = - \frac{\epsilon_{\theta}^{(t)}(\bold{x}_t)}{\sigma(t)} = -\frac{\epsilon_{\theta}^{(t)}(\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}})}{\sigma(t)}

Plug this into the probability flow for VE-SDE:

dxˉ(t)=12dσ2(t)dtϵθ(t)(xˉ(t)σ2(t)+1)σ(t)dtdxˉ(t)dt=dσ(t)dtϵθ(t)(xˉ(t)σ2(t)+1)\begin{align*} \text{d} \bar{\bold{x}}(t) &= \frac{1}{2} \frac{\text{d}\sigma^2(t)}{\text{d}t} \frac{\epsilon_{\theta}^{(t)}(\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}})}{\sigma(t)} \text{d} t \\ \Leftrightarrow \frac{\text{d} \bar{\bold{x}}(t)}{\text{d} t} &= \frac{\text{d} \sigma(t)} {\text{d} t}\epsilon_{\theta}^{(t)}(\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}}) \end{align*}

The is exactly the same as the ODE form for DDIM. In both cases the initial conditions are xˉ(T)N(0,σ2(T)I)\bar{\bold{x}} (T) \sim \mathcal{N}(\bold{0}, \sigma^2(T) \bold{I}), so the resulting ODEs are identical.

By discretizing the above equation we have:

xˉ(tΔt)xˉ(t)=12(σ2(tΔt)σ2(t))1σ(t)ϵθ(t)(xˉ(t)σ2(t)+1)xtΔtαtΔtxtαt=12(1αtΔtαtΔt1αtαt)αt1αtϵθ(t)(xt)xtΔtαtΔt=xtαt+12(1αtΔtαtΔt1αtαt)αt1αtϵθ(t)(xt)\begin{align*} \bar{\bold{x}}(t - \Delta t) - \bar{\bold{x}}(t) &= \frac{1}{2} (\sigma^2(t - \Delta t) - \sigma^2(t)) \cdot \frac{1}{\sigma(t)} \cdot \epsilon_{\theta}^{(t)} (\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}}) \\ \frac{\bold{x}_{t - \Delta t}}{\sqrt{\alpha_t - \Delta t}} - \frac{\bold{x}_t}{\sqrt{\alpha_t}} &= \frac{1}{2} \left( \frac{1 - \alpha_{t - \Delta t}}{\alpha_{t - \Delta t}} - \frac{1 - \alpha_t}{\alpha_t} \right) \cdot \sqrt{\frac{\alpha_t}{1 - \alpha_t}} \epsilon_{\theta}^{(t)} (\bold{x}_t) \\ \Rightarrow \qquad \frac{\bold{x}_{t - \Delta t}}{\sqrt{\alpha_t \Delta t}} &= \frac{\bold{x}_t}{\sqrt{\alpha_t}} + \frac{1}{2} \left( \frac{1 - \alpha_{t - \Delta t}}{\alpha_{t - \Delta t}} - \frac{1 - \alpha_t}{\alpha_t} \right) \cdot \sqrt{\frac{\alpha_t}{1 - \alpha_t}} \epsilon_{\theta}^{(t)} (\bold{x}_t) \end{align*}

9. Parameter Setting of DDIM

DDIM is better than DDPM:

  1. Generation speeds up 10x to 100x.
  1. Once the initial xT\bold{x}_T is fixed, DDIM retains high-level image features.

No changes are needed with regards to the training procedure. The only changes made is how to produce samples from the model. This is achieved by controlling τ\tau (which controls how fast the samples are obtained) and σ\sigma (which interpolates between the deterministic DDIM and the stochastic DDPM).

We consider different sub-sequences τ\tau of [1,,T][1, \dots, T] and different hyperparameters σ\sigma indexed by elements of τ\tau. To simplify comparisons, we consider σ\sigma with the form:

στi(η)=η(1ατi1)/(1ατi)1ατi/ατi1\sigma_{\tau_i}(\eta) = \eta \sqrt{(1 - \alpha_{\tau_{i-1}}) / (1 - \alpha_{\tau_i})} \sqrt{1 - \alpha_{\tau_i} / \alpha_{\tau_{i-1}}}

where ηR0\eta \in \mathbb{R}_{\geq 0} is a hyperparameter that can be directly controlled. This includes an original DDPM generative process when η=1\eta = 1 and DDIM when η=0\eta = 0.

The sub-sequence τ\tau is selected given dim(τ)<T\text{dim}(\tau) < T:

  1. Linear\bold{\text{Linear}}: τi=ci\tau_i = \lfloor ci \rfloor
  1. Quadratic\bold{\text{Quadratic}}: τi=ci2\tau_i = \lfloor ci^2 \rfloor

The constant value cc is selected that the last element τ1\tau_{-1} is close to TT.